A Comparison of Reinforcement Learning Methods for Automatic Guided Vehicle Scheduling

نویسنده

  • DoKyeong Ok
چکیده

Automatic Guided Vehicles or AGVs are increasingly being used in manufacturing plants for transportation tasks. Optimal scheduling of AGVs is a difficult problem. A learning AGV is very attractive in a manufacturing plant since it is hard to manually optimize the scheduling algorithm to each new situation. In this paper we compare four reinforcement learning methods for scheduling AGVs. Q-learning[Watkins and Dayan 921 and R-learning[Schwartz 931 do not use action models. Q-learning optimizes the discounted total reward, while R-learning optimizes the average undiscounted reward per step. ARTDP[Barto et al. to appear] is a discounted method that uses action models. H-learning[Tadepalli and Ok 941 is an undiscounted version of ARTDP based on an algorithm of Jalali and Ferguson[Jalali and Ferguson 891. In our domain(see Figure l), there are two queues generating jobs, an AGV, a moving obstacle and two lanes. Queue 1 generates jobs for lane 2 half the time and Queue 2 always generates lane 1 jobs. The task of AGV is to move jobs from the queues to their destination lanes while avoiding collisions with the obstacle, which randomly moves up and down. There are a total of 540 states. At any time an AGV may do nothing, load, unload, or move up, down, left or right. The goal is to maximize the average reward per step. An experiment compared Q-learning, R-learning, ARTDP and H-learning in our AGV domain. The reward is -5 when the AGV collides with the obstacle, +5 when it unloads a job to lane 1, and +l when it unloads a job to lane 2. Figure 1 shows the medians of average reward per step over 30 trials evaluated separately after turning off learning at various stages. To enable exploration, 50% of the time a randomly chosen action was excuted during learning. The parameters of ARTDP, Q-learning, and R-learning were tuned to this domain by trial and error. ARTDP with discount factor y=O.9 and Q-learning could not converge to the optimal policy even after 2 million steps. Even though R-learning and ARTDP with high y converged to the

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Model-Based Average Reward Reinforcement Learning

Reinforcement Learning (RL) is the study of programs that improve their performance by receiving rewards and punishments from the environment. Most RL methods optimize the discounted total reward received by an agent, while, in many domains, the natural criterion is to optimize the average reward per time step. In this paper, we introduce a model-based Average-reward Reinforcement Learning meth...

متن کامل

Scaling Up Average Reward Reinforcement Learning by Approximating the Domain Models and the Value Function

Almost all the work in Average-reward Reinforcement Learning (ARL) so far has fo-cused on table-based methods which do not scale to domains with large state spaces. In this paper, we propose two extensions to a model-based ARL method called H-learning to address the scale-up problem. We extend H-learning to learn action models and reward functions in the form of Bayesian networks, and approxima...

متن کامل

Scheduling of Multiple Autonomous Guided Vehicles for an Assembly Line Using Minimum Cost Network Flow

This paper proposed a parallel automated assembly line system to produce multiple products having multiple autonomous guided vehicles (AGVs). Several assembly lines are configured to produce multiple products in which the technologies of machines are shared among the assembly lines when required. The transportation between the stations in an assembly line (intra assembly line) and among station...

متن کامل

Operation Scheduling of MGs Based on Deep Reinforcement Learning Algorithm

: In this paper, the operation scheduling of Microgrids (MGs), including Distributed Energy Resources (DERs) and Energy Storage Systems (ESSs), is proposed using a Deep Reinforcement Learning (DRL) based approach. Due to the dynamic characteristic of the problem, it firstly is formulated as a Markov Decision Process (MDP). Next, Deep Deterministic Policy Gradient (DDPG) algorithm is presented t...

متن کامل

Low-Area/Low-Power CMOS Op-Amps Design Based on Total Optimality Index Using Reinforcement Learning Approach

This paper presents the application of reinforcement learning in automatic analog IC design. In this work, the Multi-Objective approach by Learning Automata is evaluated for accommodating required functionalities and performance specifications considering optimal minimizing of MOSFETs area and power consumption for two famous CMOS op-amps. The results show the ability of the proposed method to ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994